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Executive  Summary 


Imposing  nonnegative  constraints  on  least  squares  problems  arises  naturally  in  many 
fields,  usually  to  avoid  nonphysical  solutions,  such  as  negative  chemical  concen¬ 
trations  in  chemical  compositions  or  negative  pixel  intensities  in  computer  graph¬ 
ics.  The  motivation  for  this  work  comes  from  a  particular  method  for  constructing 
nonlinear  reduced-order  models  (ROMs)  where  nonnegative  constraints  on  associ¬ 
ated  least  squares  problems  are  imposed  to  maintain  the  stability  of  the  ROM.  This 
methodology  was  developed  for  finite  element  models  at  the  Army  High  Perfor¬ 
mance  Computing  Research  Center.1  The  nonnegative  least  squares  (NNLS)  prob¬ 
lem  comes  from  the  embedded  hyper-reduction  step  referred  to  as  Energy  Conserv¬ 
ing  Sampling  and  Weighting  (ECSW).  ECSW  reduces  the  complexity  of  a  ROM  by 
choosing  a  minimal  subset  of  the  finite  element  mesh,  referred  to  as  a  reduced  mesh, 
on  which  to  evaluate  the  nonlinear  function  when  integrating  the  ROM.  Choosing 
this  mesh  requires  the  solution  of  a  NNLS  problem  where  the  sparsity  of  the  solu¬ 
tion  is  associated  with  the  reduced  mesh.  Each  nonzero  entry  in  the  solution  vector 
defines  the  reduced  mesh  and  is  a  weighting  factor  for  evaluating  the  nonlinear 
function  at  that  mesh  point;  hence,  the  overall  complexity  of  the  ROM  is  tied  to  the 
sparsity  of  the  NNLS  solution.  For  this  case  the  NNLS  solution  does  not  have  to 
be  optimal,  but  it  must  satisfy  some  tolerance  associated  with  the  accuracy  of  the 
ROM. 

Solving  large  NNLS  problems  is  computationally  intensive  and  on  a  single  pro¬ 
cessor  can  sometimes  take  days.  In  many  cases,  as  with  problems  that  arise  from 
ECSW,  they  are  too  big  to  fit  on  a  single  node  so  a  parallel  implementation  is  nec¬ 
essary.  As  will  be  shown,  ECSW  NNLS  problems  are  not  solved  to  completion 
but  to  a  tolerance  to  achieve  the  goal  of  a  minimal,  reduced  mesh  with  acceptable 
accuracy.  Thus,  a  stopping  criteria  needs  to  be  imposed,  and  the  method  must  pro¬ 
duce  a  sparse  solution.  Unfortunately,  there  are  no  readily  available  parallel  NNLS 
software  packages  that  do  this. 

There  are  a  number  of  algorithms  that  solve  NNLS  problems.2  In  this  work,  2  algo- 

1  Farhat  C,  Avery  P,  Chapman  T  and  Cortiel  J.  Dimensional  reduction  of  nonlinear  finite  element 
dynamic  models  with  finite  rotations  and  energy-based  mesh  sampling  and  weighting  for  computa¬ 
tional  efficiency.  Int  J  Numer  Meth  Engng.  2014;98:625-662. 

2Chen  D,  Plemmons  R.  Nonnegativity  constraints  in  numerical  analysis.  In  proceedings  of  the 
2007  Symposium  on  the  Birth  of  Numerical  Analysis;  World  Scientific  Publishinc  Co.,  2010,  109— 
140. 

Approved  for  public  release;  distribution  is  unlimited. 


VI 


rithms  are  implemented  and  parallelized  using  ScaLAPACK.  They  are  the  Lawson 
and  Hanson  active  set  method  and  the  projected  quasi-Newton  iterative  method. 
The  projected  quasi-Newton  method  does  not,  in  general,  produce  a  sparse  solu¬ 
tion,  so  modifications  to  the  basic  method  are  made  to  promote  this.  These  methods 
and  their  implementations  are  discussed  in  Section  3.  Performance  results  for  both 
are  presented  in  Section  4.  Both  methods  were  integrated  into  the  "aeros"  finite 
element  structures  code  developed  at  the  Army  High  Performance  Computing  Re¬ 
search  Center  at  Stanford  University  by  the  Farhat  Research  Group.  This  code  is 
equipped  with  model  order  reduction  capabilities  and  ECSW  hyper-reduction.  Re¬ 
sults  using  the  2  methods  when  constructing  a  reduced-order  model  of  a  general 
V-Hull  vehicle  subjected  to  an  under-body  blast  are  presented.  For  technical  details 
of  the  code  and  its  model  order  reduction  capabilities,  see  the  works  of  Farhat  et 
al.1 
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1.  Introduction 


Nonlinear  model  order  reduction  methods  are  being  developed  at  the  Army  High 
Performance  Computing  Research  Center  (AHPCRC)  at  Stanford  University  by  the 
Farhat  Research  Group  for  the  purpose  of  enabling  parametric  studies  of  complex 
physical  processes,  such  as  under-body  blasts,  in  a  reasonable  amount  of  time.  Be¬ 
cause  of  nonlinearity,  the  complexity  of  the  reduced-order  model  (ROM)  remains 
on  the  order  of  the  high-dimensional  model  since  evaluation  of  the  nonlinear  func¬ 
tion  cannot  be  computed  off-line  during  construction  of  the  ROM;  thus,  a  second- 
tier  approximation,  known  in  the  literature  as  hyper-reduction,  is  necessary  for  fast 
evaluation  of  the  nonlinear  function  when  integrating  the  ROM.  One  such  hyper¬ 
reduction  approach  is  the  Energy  Conserving  Sampling  and  Weighting  (ECSW) 
method  developed  at  the  AHPCRC.  It  is  designed  specifically  for  finite  element 
models  and  derived  from  an  energy  conservation  argument  that  seeks  to  maintain 
the  total  energy  in  the  reduced  model  as  in  the  high-dimensional  model.  This  proce¬ 
dure  requires  a  sparse,  approximate  solution  to  a  nonnegative  least  squares  (NNLS) 
problem,  where  the  sparsity  represents  the  element  set  used  to  evaluate  the  nonlin¬ 
ear  terms,  and  the  values  are  the  weights  applied  to  these  elements.  The  sparsity 
reduces  the  overall  complexity  of  the  nonlinear  reduced-order  model,  thus  allowing 
parametric  studies  to  be  done  in  a  reasonable  amount  of  time. 

NNLS  problems  that  emerge  from  the  ECSW  hyper-reduction  procedure  are  gener¬ 
ally  large,  ill-conditioned  matrices  whose  column  dimension  is  equal  to  the  number 
of  elements  in  the  finite  element  model  and  whose  row  dimension  is  the  number 
of  basis  vectors  in  the  reduced-order  model  times  the  number  of  snapshots  used  to 
produce  the  model.  By  way  of  construction,  x  =  1  is  always  a  dense  solution  to 
these  systems;  however,  the  objective  is  to  find  a  sparse,  approximate  solution  that 
is  sufficiently  accurate.  The  solution  to  such  a  problem  on  a  single  processor  can 
take  days  and  so  the  objective  of  this  work  is  to  develop  scalable  solution  techniques 
that  are  compatible  with  the  ECSW  procedure  and  produce  solutions  quickly. 

Two  NNLS  solution  methods  are  considered:  the  Lawson  and  Hanson  active  set 
method  and  a  projected  quasi-Newton  (PQN)  iterative  method.  Active  set  methods 
are  a  natural  choice  for  use  with  the  ECSW  hyper-reduction  procedure  since  they 
typically  remove  one  or  more  variables  from  the  active  set  (i.e.,  the  set  of  optimiza¬ 
tion  variables  that  are  constrained)  so  that  a  sparse,  approximate  solution  can  be 
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achieved  with  a  simple  stopping  criterion.  Iterative  methods  on  the  other  hand,  typ¬ 
ically  do  not  do  this.  The  PQN  method  presented  here  is  modified  to  control  the  size 
of  the  active  set,  thereby  making  it  appropriate  for  hyper-reduction.  Both  methods 
are  implemented  and  parallelized  using  ScaLAPACK. 

The  general  NNLS  problem  and  some  details  about  ECSW  are  presented  in  Section 
2.  Section  3  discusses  the  Lawson  and  Hanson  (LH)  and  PQN  algorithms  and  their 
parallel  implementation.  Results  and  conclusions  are  presented  in  Sections  4  and  5, 
respectively. 

2.  Nonnegative  Least  Squares 

The  NNLS  problem1  is  a  constrained  least  squares  problem  where  all  components 
of  the  solution  vector  must  be  greater  than  or  equal  to  zero.  Given  a  matrix  A  6 
lMxiV  and  a  right-hand  side  b  e  Mm,  the  problem  is  to  find  an  x  e  Wix  that  satisfies 


2 

2 


(1) 


subject  to  x  >  0. 


It  is  well  known  that  the  solution  x  must  satisfy  the  Karush-Kuhn-Tucker  (KKT) 
optimality  conditions  given  by 


x  >  0 
V/(x)  >  0 
V/(x)Tx  =  0. 


(2) 


(Lor  proof  of  the  KKT  conditions,  see  Nocedal  and  Wright,2  chapter  12.)  Associated 
with  the  solution,  and  indeed  any  feasible  vector  x,  is  a  set  of  indices  where  x*  =  0. 
This  is  referred  to  as  the  active  set.2  If  the  active  set  at  the  optimal  solution  is 
known  a  priori,  then  solving  Eq.  1  is  equivalent  to  solving  an  equality  constrained 
least  squares  problem. 

When  the  matrix  A  comes  from  the  ECSW  hyper-reduction  procedure,  it  consists 
of  the  unassembled  elemental  contributions  to  the  reduced  vector  of  forces  to  be 
approximated,  and  the  right-hand  side  b  is  the  assembled  reduced  vector  of  forces, 
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see  Farhat  et  al.3  Each  component  of  b  is  then  the  sum  over  the  columns  of  A  for 
the  associated  row.  That  is, 

N 

bi  ^  ^  cijj .  (3) 

i= 1 

Problems  constructed  in  this  way  obviously  have  a  dense  solution  given  by  x  =  1; 
however,  the  ECSW  objective  is  to  find  a  sparse  solution  x  that  is  accurate  enough 
for  approximating  the  nonlinear  functions  in  the  reduced  model.  The  sparsity  rep¬ 
resents  the  element  set  with  which  to  evaluate  the  nonlinear  terms  and  the  values 
the  weights  applied  to  these  elements.  As  discussed  in  Farhat  et  al.,3  the  accuracy 
of  the  desired  solution  is  controlled  by  a  parameter  r  that  controls  the  residual.  The 
corresponding  approximate  problem  is 

min  /(x)  =  -  II  Ax  -  b||9 

M  ;  2  11  112  (4) 

$  =  (x£  M.n  |  ||  Ax  —  b || 2  <  r  ||b||9  ,  x  >  0}. 

The  ECSW  matrix  generally  has  a  large  condition  number  making  it  difficult  to 
solve.  As  a  result,  both  methods  are  implemented  with  an  option  to  scale  the  matrix 
columns  so  that  the  £2  norm  of  each  column  is  1. 

3.  Methods  and  Implementations 

This  section  describes  the  LH  and  the  PQN  algorithms  and  their  parallel  implemen¬ 
tations.  The  implementations  are  not  limited  to  ECSW  hyper-reduction  problems; 
they  are  both  capable  of  solving  the  usual  NNLS  problem  Eq.  1 .  A  simple  stopping 
criterion,  based  on  a  user-supplied  input  parameter,  is  added  to  both  methods  to 
promote  sparse  solutions.  This  is  sufficient  for  the  LH  method  but  not  for  the  PQN 
method.  For  the  latter  method  the  size  of  the  active  set  is  controlled  to  promote 
sparse  solutions.  This  is  described  in  Section  3.2.1. 

3.1  Lawson  and  Hanson 

The  LH  algorithm  is  an  active  set  method  that  solves  Eq.  1  in  a  finite  number  of 
iterations  by  constructing  a  sequence  of  active  sets  and  associated  feasible  vectors, 
which  converge  to  the  optimal  solution.  This  iterative  procedure  consists  of  2  loops: 
an  outer  iteration  loop  and  a  sub-iteration  loop  referred  to  as  the  downdate  proce¬ 
dure.  Each  outer  iteration  decreases  the  current  active  set  by  one  and  checks  its 
validity  by  solving  an  equality  constrained  optimization  problem.  If  the  solution  is 
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feasible,  then  the  new  active  set  is  valid  and  the  next  outer  loop  iteration  begins.  If 
not,  the  downdate  procedure  is  invoked  where  selected  optimization  variables  are 
added  back  to  the  active  set,  and  an  equality  constrained  problem  associated  with 
the  new,  larger  active  set,  is  solved.  If  this  produces  a  feasible  vector,  the  next  outer 
loop  iteration  begins,  otherwise,  the  downdate  procedure  is  repeated  until  the  solu¬ 
tion  becomes  feasible  again.  The  method  terminates  when  the  active  set  cannot  be 
decreased  without  producing  an  infeasible  solution,  or  by  some  other  criteria,  such 
as  the  tolerance  on  the  residual. 


The  implementation  is  best  described  by  reformulating  the  optimality  conditions 
Eq.  2  in  terms  of  the  active  set  as 

w  =  — V/(x)  =  AT  (b  -  Ax) 

Xi  =  0  for  i  e  Z,  Xi  >  0  for  i  e  V 
Wi  <  0  for  i  e  Z,  Wi  =  0  for  i  e  V, 

where  the  set  [1,  2, . . . ,  N]  is  partitioned  into  2  sets 

V  =  {i  |  Xi  >  0} 

The  active  set  is  Z  and  is  the  complement  of  V.  If  x,  w  6  E'V  satisfy  Eq.  5  then  x 
satisfies  Eq.  2. 


(5) 


(6) 


The  algorithm  produces  a  sequence  of  active  sets  and  solve  the  associated  equality 
constrained  minimization  problem 


min  /(x)  =  |  || Ax  —  b|| 
subject  to  Xi  =  0,  i  e  Z 


(7) 


for  each  set.  The  solution  to  Eq.  7  is  obtained  by  partitioning  A  and  x  according 
to  V  and  solving  the  unconstrained  least  squares  problem  for  the  portion  of  A  and 
x  associated  with  V.  If  %  e  RNxN  is  a  permutation  matrix  that  permutes  the  p 
columns  of  A  whose  indices  are  in  set  V  to  the  first  p  columns  of  the  matrix,  then 
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the  desired  partition  is  given  by 


A  ilp  Ap  A-z.  5  ilp x  y 


yP 

y* 


(8) 


There  is  no  required  ordering  of  the  columns  of  Ap  so  ^[p  is  not  unique.  The  permu¬ 
tation  matrix  defines  an  ordering  on  the  set  V  denoted  by  P  =  ("P,  %>). 


With  this  partition  problem  Eq.  7  is  equivalent  to  the  unconstrained  least  squares 
problem 

T  h\\AvYp-m  (9) 

with  yp  e  W\  It  is  solved  using  a  QR  factorization  of  Ap  written  as 


Ap  Qr 


R„ 


0 


,  Qp  —  Hi  Ho 


H 


pi 


(10) 


where  Qp  is  a  composition  of  elementary  reflectors  H,.  If 


Qp  b  =  g 


(11) 


then  the  solution  to  Eq.  9  is 

yP  =  Rp 1  g  p-  (12) 

The  solution  exists  if  Ap  is  linearly  independent.  The  solution  to  Eq.  7  is  then 


x  =  Ip 


yP 

y*  =  o 


and  the  residual  r  is 


r  =  b  -  Ay  =  Qp 


Op 

gz 


(13) 


(14) 


Algorithm  1  presents  the  ScaLAPACK  implementation  where  the  main  parts  are  the 
outer  iteration  loop  starting  at  line  4,  the  downdate  procedure  starting  at  line  13,  and 
the  QR  update  procedure  starting  at  line  27.  Some  comments  on  the  right  of  each 
line  contain  the  equation  number  for  reference  back  to  the  text  and,  if  appropriate, 
the  ScaLAPACK  routine  used  at  that  point.  Inputs  are  the  matrix  A,  the  right-hand 
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side  b,  optional  stopping  criteria  r  the  controls  the  residual  and  pmax  that  limits  the 
size  of  set  P,  a  flag  for  scaling  the  matrix,  and  various  ScaLAPACK  parameters 
that  control  the  processor  grids  and  block  sizes  (not  shown).  Parameter  r  stops  the 
calculation  when  the  C1  norm  of  the  residual  drops  below  r  ||b||2  and  pmax  stops  the 
calculation  if  the  size  of  set  P  reaches  pmax.  If  scaling  is  selected,  then  the  matrix 
A  is  postmultiplied  by  a  diagonal  scaling  matrix  \s  whose  diagonal  entries  are  the 
inverse  of  the  C2  norm  of  each  column  vector.  The  Output  is  the  solution  vector  x. 


After  initialization,  the  outer  iteration  loop  is  entered  where  r,  w,  and  Wimax,  the 
maximum  component  of  w  over  indices  in  set  Z ,  are  computed  (lines  5,  6  and  7). 
If  wimax  >  0,  then  there  is  a  potential  for  finding  a  new  feasible  vector  that  reduces 
the  objective  function  /(x)  so  index  imax,  the  most  likely  candidate  for  removing 
from  the  active  set,  is  moved  to  P.  If  wimax  <  0  then  the  calculation  is  complete. 
Equation  9  is  then  solved  for  yp,  and  if  the  minimum  component  of  yp  is  strictly 
greater  then  zero,  then  y  =  [yp,  0 1 7  is  feasible  and  the  current  active  set  is  valid. 


If,  on  the  other  hand,  any  component  of  yp  is  less  than  or  equal  to  zero,  the  current 
active  set  is  not  valid  and  the  downdate  procedure  is  invoked.  First,  the  previous 
solution  y  is  updated  using  the  new  but  infeasible  yp  as  follows: 


y  =  y  +  a  (yp  -  y) ,  yp 


y pi  ^  ( N—p ) 


a  =  min 


VPi  <  0 ,i  e  {1.2,..../;} 


(15) 


This  choice  of  a  ensures  y  remains  feasible  and  has  at  least  one  component  equal 
to  zero  with  an  index  in  P.  This  component  is  moved  from  set  P  to  Z  along  with 
any  others  that  are  zero.  Equation  9  is  solved  again  and  the  process  repeats  until  a 
new  active  set  is  found  that  produces  a  feasible  yp. 


The  QR  factorization  of  Ap  is  required  in  both  the  outer  iteration  loop  and  the 
downdate  procedure.  The  outer  loop  constructs  Ap  from  Ap_i  by  adding  a  new  col¬ 
umn  vector  at  index  p  and  the  downdate  procedure  removes  one  or  more  columns. 
Because  it  is  stored  compactly  consistent  with  ScaLAPACK,  the  factorization  can 
be  incrementally  updated  by  computing  and  applying  new  elementary  reflectors 
where  appropriate.  If  the  only  change  is  a  new  column  vector  at  index  p,  then  only 
H,,  in  Eq.  10  needs  to  be  computed.  If  one  or  more  column  vectors  are  removed 
and  qmm  is  the  smallest  index  in  the  ordered  set  P  of  the  vectors  that  are  to  be 
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removed,  then  only  elementary  reflectors  Hqrnin.Hqrnin+1, ...,  Hp  need  to  be  recom¬ 
puted.  Maximizing  qmin,  therefore,  reduces  the  work  associated  with  the  downdate 
procedure. 

Updating  the  QR  factorization  is  done  in  procedure  UpdateQR(A).  It  starts  by  copy¬ 
ing  the  columns  of  A  associated  with  indices  k  through  p  of  Ap  to  the  appropriate 
sub-matrix  of  the  ScaLAPACK  matrix  that  holds  the  factorization  represented  by 
[qi,  q2, . . . ,  qp]  in  Algorithm  1.  The  ScaLAPACK  routine  pdormqr  is  called  at  line 
32  to  multiply  the  sub-matrix  [qfe,  q^+i, . . . .,  qp]  by  Qfc_i,  followed  by  a  call  to 
pdgeqrf  at  line  33  to  complete  the  factorization.  The  vector  g  is  either  incremen¬ 
tally  updated  at  line  35  or  recomputed  from  b  at  line  37.  It  is  worth  noting  that  the 
storage  requirement  for  holding  the  factorization  is  0(M  x  M);  however,  if  the 
optional  stopping  criteria  pmax  is  specified,  then  the  storage  requirement  is  reduced 
to  0(M  x  pmax ). 

When  the  downdate  procedure  increases  the  active  set,  or  equivalently  removes 
column  vectors  from  Ap,  the  new  ordering  imposed  on  the  columns,  although  not 
unique,  could  have  an  impact  on  performance  through  the  value  of  qmin  at  each 
iteration.  Two  orderings  were  implemented.  The  first  fills  in  the  holes  with  column 
vectors  that  were  added  most  recently,  and  the  second  shifts  all  columns  to  the  left. 
The  shift  ordering  performed  better  most  likely  because  any  column  vector  that  is 
associated  with  the  optimal  active  set  tends  to  migrate  to  the  lower  column  indices 
of  matrix  Ap,  hence  maximizing  qrnin  on  subsequent  iterations. 

The  processor  grid  used  by  ScaLAPACK  to  distribute  the  matrices  and  parallelize 
the  computations  can  have  a  large  influence  on  performance.  Using  the  assumption 
that  Ap  e  R Mxp  where  p  <U  M,  the  most  efficient  processor  grid  for  the  QR  fac¬ 
torization  is  Nprocs  x  1,  where  Nprocs  is  the  number  of  processors.  This  assumption 
is  true  for  the  early  part  of  any  calculation  when  p  is  small  and  is  also  true  for  the 
optimal  solution  of  many  problems;  however,  it  is  not  true  in  all  cases.  But,  since 
the  goal  of  ECSW  hyper-reduction  step  is  to  produce  a  sufficiently  optimal  solution 
with  p  <C  M,  the  assumption  is  valid  here. 

For  large  matrices  the  most  expensive  part  of  the  algorithm  is  computing  w.  Be¬ 
cause  the  optimal  processor  grid  for  the  QR  factorization  is  not  necessarily  the 
best  for  computing  w,  2  separate  processor  grids,  or  BLACS  contexts,  were  imple¬ 
mented.  One  context  is  used  for  the  QR  factorization  and  all  computations  aligned 
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with  it,  while  the  other  is  used  for  computations  aligned  with  the  matrix  A.  It  was 
found  that  the  overhead  associated  with  multiple  BLACS  contexts  generally  ex¬ 
ceeded  the  benefit,  so,  in  general,  the  single  processor  grid  Nprocs  x  1  is  used  for 
both. 

Not  shown  in  Algorithm  1,  but  implemented  in  the  code,  is  a  check  on  linear  inde¬ 
pendence  of  A p  as  each  new  column  vector  is  added.  It  closely  follows  the  original 
implementation  in  Lawson  and  Hanson.4  It  is  rarely  executed  so  it  was  not  included 
in  Algorithm  1 . 
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Algorithm  1  Parallel  LH  (PLH)  NNLS  Algorithm 

Input: 

A  G  MMx7V,  b  G  MM,  t,  pmax ,  scale 

Output: 


1:  Initialize:  x  =  y  =  0,  V  =  0,  p  =  0,  g  =  b,  Q0  =  I 
2:  if  scale  then 
3:  A  i —  A  A.s 

4:  repeat 

r  i  t 

5:  r  <—  Qp  [0P,  g(N_p) 

6:  w  i - V/(x)  =  A7  r 

7;  wimax  =  max  {wi\i  =  1,2,  ...N,  i^V} 

8:  if  >  0  then 

9:  p-<-p+l,  'P(p) 

max 

10:  UpdateQR(p) 

11:  Solve  Rpyp  =  gp 

12:  y„un  =  min  {yPi  \  i  =  1,  2,  ...p} 

13:  while  <  0  do 

14:  a  =  min  |^r  I  VPl  <  0, ®  G  {l,2,...,p}| 

15:  y  <-  y  +  a(yP  -  y),  where  yp  =  yp,  0(JV_p. 

16:  V0  =  {P{i)  |  yi  =  0,  i  =  1,2,  ...,p} 

17:  qmin  =  min  i  G  P0 

18:  T  =  V\V0,  P<-\V\ 

19:  UpdateQR(f/mm) 

20:  Solve  Rp  yp  =  gp 

21:  ymin  min  {pPi  |  i  1)2,  .  ..p} 

r  i T 

22:  %  [yP,0(M_p) 

23:  until  Wi  <  0  Vi  G  Z  or  p  =  pmax  or  ||r||2  <  r  ||b||2 
24:  if  scale  then 
25:  x  G-  Asx 

26:  Return  x 


>  pdnrm2 

>  Eq.  14,  pdormqr 
>  Eq.  5,  pdgemv 


>  Eq.  12,  pdtrsv 


>  Downdate 


>  Eq.  12,  pdtrsv 
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27 

procedure  UpdateQR(/c) 

28 

for  i  =  k  — >  p  do 

29 

3  =  V{i) 

30 

q,  &j 

31 

if  k  >  1  then 

32 

[qfcj  qjfc+1,  ■  •  •  qp]  •<—  Hfc_2, . .  EQ  [qfc,  q^+i, . 

.  qp]  >  pdormqr 

33 

[qfc,  qfc+i,  •  ••  tjp]  <-  QR(fc,[qjfe,  q*+i,  •••  qP]) 

>  pdgeqrf 

34 

if  k  —  p  then 

35 

g  <-  Hp  g 

>  pdormqr 

36 

else 

37 

g  QPb 

>  pdormqr 

3.2  Projected  Quasi-Newton 

The  PQN  algorithm  is  an  iterative  method,  not  an  active  set  method,  and  as  such 
does  not  lend  itself  to  a  threshold  based  termination  criterion  that  encourages  a 
sparse  solution  as  required  by  the  ECSW  hyper-reduction  procedure.  A  simple  mod¬ 
ification  that  controls  the  size  of  the  active  set  makes  it  usable  for  ECSW  and  results 
in  an  algorithm  that  is  referred  to  here  as  the  Limited  PQN  (LPQN)  method.  The 
parallel  implementation  of  the  original  method  is  described  first  followed  by  Sec¬ 
tion  3.2.1,  which  describes  the  LPQN  modifications.  The  reader  is  referred  to  Kim 
et  al.5  for  details  about  the  algorithm  and  its  convergence  properties. 

The  PQN  algorithm  is  an  iterative  procedure  that  produces  a  sequence  of  feasible 
vectors  {xfc},  k  =  0, 1,  •  •  • ,  x/,:  e  K'V,  that,  under  certain  conditions,  converge  to 
the  solution  of  Eq.  1.  It  is  initialized  with  any  feasible  vector,  usually  x°  =  0.  At 
each  iteration  the  components  of  xA:  are  partitioned  into  sets  of  free  and  fixed  vari¬ 
ables  where  the  fixed  variables  are  constrained  to  zero.  A  NNLS  problem  is  con¬ 
structed  for  the  free  variables  and  the  current  iteration  operates  on  it.  The  updated 
free  variables  combined  with  the  fixed  variables,  produces  the  next  iterate  xfc+1. 
The  projection  step  guarantees  feasibility.  Once  the  optimal  active  set  is  reached, 
the  method  reduces  to  solving  an  unconstrained  least  squares  problem  for  the  free 
variables. 

The  implementation  is  described  from  the  point-of-view  of  the  kth  iteration  where 
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xfc  and  rk  are  known  and  iterate  xfc+1  and  residual  rA:+1  are  computed.  The  iteration 
starts  by  partitioning  xfc  into  sets  of  fixed  and  free  variables  as, 

Ztmn={,\x>‘  =  0,  [V/(x‘)].>0}  (16) 

'Ptmn  =  {l,2,...,N}\Zmn, 

where  subscript  i  is  the  component  index.  Because  of  the  positivity  constraint  on 
the  gradient,  the  fixed  set  Zkpqn  is  a  subset  of  the  active  set.  The  free  set  Vkpqn  is  the 
complement  of  Zkqn.  Only  the  free  variables  are  updated  during  the  iteration.  The 
fixed  variables,  appropriately  named,  are  not. 

The  partition  induces  a  NNLS  problem  on  the  free  variables.  For  convenience,  the 
matrix  is  permuted  so  that  the  free  variables  are  associated  with  a  contiguous  block 
of  column  vectors  starting  at  column  1,  so  the  NNLS  problem  description  is  given 
in  terms  of  the  permuted  matrix  Afc+1  used  to  compute  xfc+1.  If  Pfc  e  MJVxJV  is  a 
permutation  matrix  that  permutes  the  pk  =  \  Vk)qn  |  columns  of  a  matrix  whose 
indices  are  in  set  Vkqn  to  the  first  pk  columns,  then  the  partitioned  matrix  can  be 
written  as 


^fc+l  _  p0  pi  pfc  __  p k 


A 


k+ 1 
' Pk  ’ 


(17) 


where  the  subscripts  pk  and  zk  denote  the  partitioning  based  on  Vkpqn  and  Zkpt 
respectively,  and  A0  =  A.  The  resulting  problem  in  the  free  variables  is 


arg  min 

Z  G  ,  z>0 


Ak+1z 

Pk 


(18) 


The  feasible  vector  xfc  and  the  gradient  V/  (xfc) ,  which  are  aligned  with  the  matrix 
Ak  from  the  previous  iteration,  are  permuted  to  align  with  the  matrix  AA"+ 1  and  are 
given  by 

yk  —  p  k1  xfc  _ 

V/  (yfc)  =  PfcT  V/  (xfc)  = 

The  notation  is  such  that  the  partition  of  the  gradient  follows  that  of  the  argument 


V/  (y\ 
v/  (y\ 
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to  /;  therefore,  V/  (yA:)  is  aligned  with  yk.  The  feasible  vector  at  the  end  of  the 
k  +  1th  iteration  is  written 

xfc+1 

xfc+i  =  pk 

yk  =0 
L  J  zk 

where  x;' + 1  is  an  approximate  solution  to  Eq.  18  given  by  the  current  PQN  iteration. 

Pk 

The  PQN  algorithm,  as  presented  by  Kim  et  al.,5  uses  a  line  search  method  and  a 
projection  that  guarantees  feasibility.  The  iteration  update  is  computed  as 

xfc+ 1  =  yk  +  ak  dfc,  (21) 

pk  Pk  ’  ' 

where  the  search  direction,  d)'  is  given  by 

(/?;<)-<■  (22) 

Parameters  a  and  (3  define  the  line  search  method.  Projection  yk  is 

T‘(ftyr‘J=p[y‘t  -/JS^vs(^)].  (23) 

where  P  is  the  projection  onto  and  the  matrix  is  the  principal  submatrix 
of  an  approximation  to  the  inverse  of  the  Hessian,  V2/(xfc)  1  suitably  permuted. 

Two  line  search  methods  were  discussed  in  Kim  et  al.5  and  both  were  implemented. 
The  first  is  the  limited  minimization  method  that  fixes  (3  and  computes  a  as  the 
minimum  of  g(  yk  -fa  dfc)  over  the  variable  a.  The  resulting  line  search  parameters 
are 

dkTAk+iT  [b  -  Ak+1yk 

ak  = - — - ± - P- — /3  =  fixed  constant.  (24) 

Afc+1dfc 

Pk  2 

The  convergence  proof  in  Kim  et  al.5  requires  (3  to  be  sufficiently  small.  The  imple¬ 
mentation  fixes  (3=1  but  reduces  it  if  dA  is  not  a  descent  direction. 

The  second  line  search  method  used  is  Armijo  along  projection  arc.  This  method 
fixes  a  =  1  and  computes  (3  as 

(3  =  sma,  (25) 
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where  parameters  s  e  (0, 1)  and  a  >  0  are  fixed  and  m  is  the  smallest  nonnegative 
integer  for  which 

/  (<  )  -  s  (t *  y*  J  )  >  W/+1  (y‘„  )  T  (y‘„  -  7*  («"V,  <  )  )  • 

(26) 

Parameter  g  e  (0,  |)  is  also  fixed. 

The  Broyden-Fletcher-Goldfarb-Shanno  (BFGS)  method  of  approximating  the  in¬ 
verse  Hessian  Sfc  requires  storing  a  dense  matrix  of  size  N2,  which  is  prohibitively 

Pk 

large  so  the  limited  memory  BFGS,  or  L-BFGS,  as  given  in  Nocedal  and  Wright’s2 
work,  is  used  instead.  L-BFGS  is  a  recursion  algorithm  that  approximates  the  prod¬ 
uct  Sk  Vg  ( yk  )  and  is  given  in  Algorithm  3.  It  computes  the  approximation  to 
the  inverse  Hessian  using  a  fixed  number,  K ,  of  difference  quantities  for  both  the 
feasible  vector  and  the  gradient  of  the  objective  function.  Since,  in  general,  a  permu¬ 
tation  is  applied  at  each  iteration,  these  difference  quantities  are  also  be  permuted 
and  are  defined  as 

uj  =  PkT  PkT  ...  Pj+lT  [xi+1  -  y'; 

L  J  (27) 

wJ  =  pkT  pkT  _  _  _  Pi+iT  [v/(x2+1)  -  V/(y7)]  , 

where  k  is  the  current  iteration  counter  and  j  e  [0, 1, . . . ,  K  —  1]  is  the  index  of  the 
saved  difference  quantities.  These  vectors  are  saved  in  a  round-robin  fashion  with 
index  j  =  l  representing  the  latest  vector  stored. 

The  PQN  implementation  with  the  limited  minimization  line  search  method  is  given 
in  Algorithm  2.  Inputs  are  the  matrix  A,  right-hand  side  b,  a  flag  to  scale  matrix  by 
the  diagonal  matrix  As,  which  scales  A  so  that  the  C~  norm  of  each  column  is  1, 
and  stopping  criteria  r,  which  stops  the  calculation  when  ||rfc+1||9  <  r  ||r°||2,  and 
ztoi ,  which  stops  the  calculation  when  the  active  set  has  been  found  and  norm  of  the 
gradient  associated  with  the  free  variables  drops  below  ztoi- 

Steps  5-7  compute  the  gradient,  and  if  at  least  one  iteration  is  performed  stores 
A  V/.  Steps  8-13  construct  the  set  Vkpqn  and  apply  the  associated  permutation  ma¬ 
trix.  Step  14  computes  the  approximation  to  the  inverse  Hessian  times  the  gradient 
using  the  L-BFGS  method.  Steps  15-22  are  the  PQN  iteration  using  limited  mini¬ 
mization  with  (5  =  1.  This  choice  of  /3  does  not  always  produce  a  descent  direction, 
so  steps  17-19  are  executed  until  it  does.  The  round-robin  counter  /  is  updated  in 
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step  23  and  Ax  is  saved  step  24.  The  iteration  is  complete  after  computing  the 
residual  at  step  25.  When  the  algorithm  terminates,  the  solution  is  permuted  back 
to  its  original  order  at  step  28.  Algorithm  3  was  taken  directly  from  Nocedal  and 
Wright’s2  book.  It  is  included  for  completeness  but  is  not  discussed. 

Both  algorithms  require  primarily  matrix-vector  multiplications,  dot  products  and 
permutations,  making  it  ideally  suited  for  parallelizing  with  ScaLAPACK.  Routines 
pdgemv,  pddot,  pdgeadd,  and  pdlapiv  were  used  for  this  purpose. 
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Algorithm  2  PQN  with  Limited  Minimization 

Input: 

A  e  RMxN,  b  €  MM,  scale,  r,  ztoi. 

Output: 


Initialize:  k  =  0,  l  =  —  1,  x°  =  0,  r°  =  — b,  A0  =  A,  V„nn  =  0- 

P'4 11 

if  scale  then 
A  <-  A  As 
repeat 

V/  (xfc)  <-  AfcV; 

if  k  >  0  then 

w*  <-  V/(xfc)  -  V/(yfc-1) 

Construct  set  Vkqn  and  associated  permutation  matrix  Pfc 
\k+l  \k-pk 

k  ^ _  p k^ ^k 


V/  (yfc)  <-  PfcTV/  (xfe) 


if  k  >  0  then 

td  <—  PkI  td;  wJ  Pfc7  wJ  ;  j  =  0, 1, . . .  min(/x ,  k)  —  1 
S^V/  (y^  =  LBFGS(/,min(K,  fe)) 

/?t  1 

d  <-  P  |yfc  -p  Sk  V/  (yfc)  1  -  yk 

r  Pfc  pfc  v  ;pfc  J  Pfc 

while  dr  V/(yfc)  >  0  do 

d<-pfyfc  -p  Sfc  V/  (yfc)  1  -  yfc 

L  Pfc  Pfc  v  yPfc  J  Pfc 

dTV/(yfc)p, 


20!  G  ^ — - 7T" 

l|Afcd||2 

21:  mid(0,  cc,  1) 


22:  xfc+i  <r- 


yk  +  ak  d 
JPfc 


l  =  (l  +  1)  mod  K 
u 1  «—  xfc+1  —  yk 

Y*k-\- 1  ^ _ 1^ 

k  A:  +  1 

until  1 1 rfc 1 1 2  <  t  ||r°||2  or  V/(x^+1)  ^  <  ztoi 
Return  [l1^2  . . .  1fc+1]  xfc+1 
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Algorithm  3  L-BFGS  2-loop  recursion 
29 :  procedure  L  B  FG  S  (l ,  m) 

30:  if  l  <  0  then 

31:  Return  V/(yfc) 


q<-v/(y*)Pj+i 

for  i  =  0,  1,  ...  m  —  1  do 

j  =  (l  —  i)  mod  m 

1 

Pj  ~  (  \T  • 

V  Pk+i)  Pk+i 
/  \  T 


^  =  Pj 


q  -e-  q  -  Vj  wJ 


if  m  =  0  then 


for  z  =  0,  1,  ...  m  —  1  do 

j  =  (l  +  1  +  i)  mod  m 

>l.i  <  P.i  { wt  )  s 


s  +  11 1  (vj  ~  Pj) 


46:  return  s 


3.2.1  Limited  PQN  (LPQN)  Method 

The  LPQN  algorithm  is  identical  to  the  original  algorithm  with  the  exception  that  a 
limit  is  imposed  on  the  number  of  free  variables.  This  limit  can  be  global,  variable 
with  each  time  step  or  both.  Assuming  the  imposed  limit  at  time  step  k  +  1  is  n/fc+1 
the  set  of  free  variables  is  given  by 


-pfc  _  ppfc-i  \  2jk  1  i  j  T>k 

'  Ipqn  lr  Ipqn  \  ^ Ipqn J  ^  r  -\-i 
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where 


■Plat ct  =  {i  I  *  e  Vtm  \  2Un,  [v/(xfc)] .  <  0} 
[V/(x*)],  <  [V/fx4)]^  Vr  6  Vl  k  Vj  6  VkM  \  V\ 

I  'Tyk  I  k  _  I  'Tyk—l  \  k  [ 

I  '  +|  —  1 5  max  I  '  Ipqn  \  ^Ipqn  |  * 


k 

select 


max 


(29) 


The  free  variables  at  time  step  k  +  1  are  then  the  free  variables  at  the  previous  step 
provided  they  have  not  moved  to  set  Z^pqn,  plus  any  additional  variables  whose 
gradients  are  the  most  negative  up  to  a  total  limit  of  rifk  ax.  This  is  implemented  by 
sorting  in  increasing  order  the  components  of  V/(xfc)  that  are  in  Vkselect  and  adding 
the  appropriate  number  of  them  to  Vkpqn  so  as  not  to  exceed  the  limit. 

The  convergence  proofs  in  Kim  et  al.’s5  report  are  valid  for  the  LPQN  method  pro¬ 
vided  the  limit  imposed  is  greater  than  or  equal  to  the  number  of  free  variables  in 
the  optimal  solution. 

4.  Results 

This  section  presents  results  using  the  PLH  and  PQN  methods  implemented  as  dis¬ 
cussed  in  the  previous  section.  The  codes  are  first  validated  on  a  number  of  small, 
random  NNLS  problems  by  comparing  the  optimal  solutions  generated  by  each 
to  the  original  LH  code.6  Scalability  results  are  then  presented  on  a  larger  ran¬ 
dom  matrix,  again  solving  for  the  optimal  solution.  Finally,  the  codes  are  applied 
to  NNLS  problems  associated  with  the  ECSW  hyper-reduction  step  for  a  reduced- 
order  model  associated  with  the  dynamic  response  of  a  generic  V-hull  vehicle  to  an 
under-body  blast. 

All  calculations  were  run  on  haise.navo.hpc.mil  or  kilrain.navo.hpc.mil,  which  are 
nearly  identical  435  TFLOPS  IBM  iDataPlex  HPC  systems  located  at  the  Navy 
Distributed  Computing  Research  Center  in  Mississippi.  Each  has  1,220  standard 
memory  nodes  with  27  GB  of  accessible  memory  per  node  and  2  Intel  Xeon  Sandy 
Bridge  E5-2670  processors  per  node  with  8  cores/processor.  Wall  clock  times  are 
measured  with  the  "gettimeofdayO"  function  and  only  include  the  times  to  solve  the 
NNLS  problems.  Times  for  constructing  or  reading  the  matrices  are  not  included. 
See  Navy  DSRC7  for  more  details  on  these  systems.  All  calculations  were  done  in 
double  precision. 
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The  solutions  are  validated  on  3  small  random  NNLS  problems  with  matrices  G 
RMx  A'  where  M  <  N ,  M  =  N  ,  and  M  >  N  by  considering  the  solutions  from 
the  original  LH  code  as  exact.  The  first  set  of  problems  are  randomly  generated  so 
that  the  off-diagonal  elements  of  the  matricies  and  the  components  of  the  right-hand 
side  are  in  [0,  1],  and  the  random  diagonal  elements  of  the  matrices  are  in  [1,  10]. 
These  positive  matrices  produce  large  active  sets,  equivalently  a  small  number  p  of 
free  variables,  and  the  PLH  algorithm  requires  very  few  entries  into  the  downdate 
procedure.  The  PLH  code  and  3  variations  of  the  PQN  code  are  tested:  the  original 
algorithm  with  no  active  set  limiting,  the  LPQN  and  a  limited  projected  Newton 
(LPN)  method,  which  is  the  LPQN  but  with  the  exact  inverse  Hessian.  For  large 
active  sets,  it  can  be  advantageous  to  compute  the  exact  inverse  Hessian  instead  of 
approximating  it  as  done  in  Algorithm  3.  For  method  LPN,  the  call  to  L-BFGS  at 
line  14  of  Algorithm  2  is  replaced  with  a  call  to  a  similar  subroutine  that  uses  the 
exact  inverse  hessian. 


For  this  first  set  of  NNLS  problems,  the  relative  errors  e *  between  the  solution  x* 
from  each  code  and  the  LH  solution  xlh  is  computed  by 


e 


* 


(30) 


and  given  in  Table  1.  The  PLH  solutions  are  exact  to  machine  accuracy  as  expected. 
The  errors  in  the  PQN  and  LPQN  solutions  are  O (10  7)  -  O(10-8).  When  the  so¬ 
lution  reaches  this  level  of  accuracy  the  PQN  method  slows  down  and  additional 
iterations  are  not  cost  effective  so  the  calculation  is  stopped.  Continuing  to  iterate 
or  solving  a  least  squares  problem  based  on  the  free  variables,  which  at  this  point  is 
the  complement  of  the  optimal  active  set,  produces  a  relative  error  within  machine 
accuracy,  but  neither  is  necessary  or  cost  effective.  (The  code  has  options  to  do 
both.)  LPN  solutions  are  exact  to  machine  accuracy.  This  is  because  the  objective 
function  is  quadratic,  and  once  the  correct  active  set  is  found,  the  LPN  method  is 
exact.  Parallel  results  on  more  than  one  processor,  not  shown  in  tables,  have  similar 
relative  errors. 


The  single  processor  wall  clock  times  in  Table  2  correspond  to  the  calculations 
reported  in  Table  1 .  Compared  to  LH,  PLH  code  is  about  4  times  faster,  PQN  is  on 
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average  6.3  times  faster,  and  LPQN  is  on  average  14.3  times  faster.  LPQN  and  LPN 
used  a  global  limit  of  1,000  free  variables.  The  column  labeled  CN  is  the  matrix 
condition  number. 

Table  1  Relative  error  of  PLH,  PQN,  LPQN  and  LPN  solutions  when  compared  to  the  orig¬ 
inal  LH  code.  All  calculations  are  in  double  precision.  LPQN  and  LPN  calculations  used  a 
global  limit  of  1,000  free  variables. 


Matrix 

^ plh 

e 

pqn 

^ Ipqn 

^ Ipn 

7,000  x  10,000 

4.0  x  10"14 

5.2  x  10~8 

6.0  x  10"8 

1.2  x  KT14 

10,000  x  7,000 

2.2  x  10"14 

2.3  x  10~8 

6.5  x  10"7 

1.6  x  10~14 

20,000  x  20,000 

3.4  x  10"14 

2.2  x  10“7 

7.0  x  10"8 

1.3  x  10”14 

Table  2  Single  processor  wall  clock  time  in  seconds  corresponding  to  the  calculations  re¬ 
ported  in  Table  1.  The  column  labeled  CN  is  the  matrix  condition  number. 


Matrix 

CN 

Wall  clock  time  (s) 

LH 

PLH 

PQN 

LPQN 

LPN 

7,000  x  10,000 

8.65  x  102 

45.9 

11.4 

8.6 

3.9 

3.1 

10,000  x  7,000 

8.72  x  102 

50.1 

13.7 

8.5 

5.7 

4.9 

20,000  x  20,000 

1.90  x  107 

412.6 

92.6 

53.0 

23.0 

11.9 

Table  3  shows  the  number  of  iterations  needed  by  each  method  to  reach  the  optimal 
solution.  The  column  labeled  p  is  the  number  of  free  variables.  For  the  PLH  method, 
the  number  of  iterations  that  exceeds  p  is  an  indication  of  the  number  of  times 
the  downdate  loop  was  executed.  For  these  matrices,  execution  of  the  downdate 
procedure  was  minimal. 


Table  3  Number  of  iterations  required  for  each  method  corresponding  to  the  calculations 
reported  in  Table  1.  The  column  labeled  p  is  the  number  of  free  variables  in  the  optimal 
solution. 


Matrix 

P 

Iteration  count 

PLH 

PQN 

LPQN 

LPN 

7,000  x  10,000 

250 

252 

235 

152 

44 

10,000  x  7,000 

275 

277 

222 

178 

43 

20,000  x  20,000 

379 

379 

312 

197 

47 

Additional  validations  are  done  using  3  matricies  of  the  same  size  as  before  but  this 
time  the  off-diagonal  elements  and  the  components  of  the  right-hand  side  are  in 
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[—1,  1].  The  diagonals  elements  are  as  before.  These  matrices  are  reasonably  well- 
conditioned,  produce  small  active  sets  (equivalently  large  p ),  and  the  PLH  method  in 
one  case  requires  significantly  more  executions  of  the  downdate  loop  than  required 
by  the  positive  matrices.  The  single  processor  wall  clock  time  in  seconds  and  the 
iteration  counts  are  in  Tables  4  and  5,  respectively.  Here  the  benefit  of  PQN  is  clear. 
It  is  better  than  2  orders  of  magnitude  faster  than  LH.  For  these  problems  PLH 
is  significantly  slower  than  LH,  but  this  is  as  expected  since  it  violates  the  design 
assumption  that  p  <C  N.  The  downdate  costs  are  high  in  first  of  the  3  problems. 
As  before,  the  computed  active  sets  are  identical  for  each  method,  and  the  relative 
errors  similar  to  the  positive  matrices.  When  p  becomes  large,  computing  the  exact 
inverse  Hessian  is  expensive,  as  expected.  Also,  when  p  is  large,  limiting  the  size  of 
the  free  set  is  not  appropriate,  so  no  such  calculations  were  made  for  this  case. 


Table  4  Single  processor  wall  clock  times  for  the  LH,  PLH  and  PQN  methods  for  3  random 
matrices  where  the  off-diagonal  entries  and  the  right-hand  sides  are  between  -1.0  and  1.0 
and  the  diagonal  entries  are  between  1  and  10.  The  column  labeled  CN  is  the  matrix  condition 
number. 


Matrix 

CN 

Wall  clock  time  (s) 

LH 

PLH 

PQN 

PN 

7,000  x  10,000 

1.13  x  101 

539.7 

2,154.0 

4.9 

747.1 

10,000  x  7,000 

1.13  x  101 

571.3 

632.6 

2.0 

78.1 

20,000  x  20,000 

2.84  x  104 

6.607.0 

15.884.4 

12.3 

2,327.4 

Table  5  Number  of  iterations  required  for  each  method  corresponding  to  the  calculations 
reported  in  Table  4.  The  column  labeled  p  is  the  number  of  free  variables  in  the  optimal 
solution. 


Matrix 

P 

Iteration  count 

PLH 

PQN 

PN 

7,000  x  10,000 

5,060 

5,468 

100 

68 

10,000  x  7,000 

3,512 

3,518 

35 

25 

20,000  x  20,000 

10,185 

10,321 

52 

35 

The  wall  clock  times  in  Table  2  show  that  the  PLH  code  on  a  single  processor 
performs  about  4  times  faster  than  the  original  LH  code  when  p  ^  N;  however, 
Table  4  shows  the  opposite  when  this  assumption  is  violated.  Both  codes  implement 
the  same  algorithm;  however,  the  implementation  differs  in  a  significant  way.  The 
original  code  applies  each  elementary  reflector  to  the  entire  matrix  but  the  PLH 
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code  applies  them  only  to  Ap.  When  the  downdate  procedure  is  invoked,  the  LH 
code  applies  Givens  rotations  to  move  the  offending  column  vector  back  to  the 
active  set,  whereas  the  PLH  code  rebuilds  the  QR  factorization  from  the  point  of 
removal.  This  can  be  costly  or  cheap  depending  on  the  location  of  the  column  vector 
that  needs  to  be  removed. 

There  are  also  storage  implications  in  each  of  these  designs.  The  storage  require¬ 
ment  for  the  PLH  code  is  greater  because  it  keeps  a  copy  of  the  original  matrix 
and  operates  on  the  matrix  Ap.  This  is  mitigated  by  the  assumption  that  p  <C  N 
and  so  the  storage  requirement  for  Ap  e  MMxp  is  significantly  reduced.  When  this 
assumption  is  not  true,  that  is  when  p  is  large,  the  storage  requirements  for  Ap  sig¬ 
nificantly  increases,  and  the  code  becomes  less  efficient.  Table  2  clearly  shows  the 
PLH  approach  is  much  better  when  the  assumption  that  p  <C  N  is  true. 

A  scalability  study  was  performed  using  a  random,  positive  matrix  of  size  65, 536  x 
131,072  and  a  random,  positive  right-hand  side.  The  problem  was  chosen  so  that 
p  <C  IV  at  the  optimal  solution.  Figures  la  and  lb  show  the  wall  clock  times  and 
speedup,  respectively,  versus  processor  count.  Here  speedup,  Sp,  is  defined  as 

sp  =  L  (31) 

-L  n 

where  Tj  is  the  sequential  time  and  Tn  is  the  time  using  n  processors.  These  prob¬ 
lems  were  too  large  to  fit  on  a  single  node,  so  Tj  was  estimated  by  Tx  =  m  Tm, 
where  m  is  the  smallest  number  of  processors  used,  in  this  case  8.  This  puts  the  first 
point  on  the  ideal  line. 

The  results  show  the  PLH  code  achieves  near  linear  speedup  and  the  PQN  codes 
achieve  reasonable  speedup  but  not  ideal.  Despite  lower  scalability,  for  this  problem 
and  at  these  processor  counts  the  PQN  codes  are  still  faster  than  the  PLH  code,  and 
significantly  so  for  the  limited  versions. 

The  PLH  and  PQN  codes  were  integrated  into  the  ECSW  hyper-reduction  procedure 
of  the  Farhat  Research  Group8  "aeros"  code  and  used  in  constructing  reduced-order 
models  of  a  generic  V-hull  vehicle,  shown  in  Fig.  2,  subjected  to  an  under-body 
blast.  The  high-dimensional,  finite  element  model  has  236,995  flexible  shell  and 
rigid  beam  elements  with  6  degrees  of  freedom  per  node.  The  aeros  code  is  also 
used  to  obtain  the  high-fidelity  solution.  The  response  of  the  vehicle  to  an  under- 
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body  blast  is  computed  out  to  time  1.0  x  1CU3.  For  the  purposes  of  building  the 
reduced-order  model,  200  snapshots  were  collected  at  uniform  time  intervals  over 
this  period  and  are  used  to  construct  a  ROM  with  100  basis  vectors  using  a  proper 
orthogonal  decomposition.  With  200  snapshots  and  100  basis  vectors  the  size  of 
the  ECWS  matrix  is  20,  000  x  236,  995.  Using  tolerances  of  r  =  .1  and  r  =  .01, 
reduced  meshes  and  associated  reduced-order  models  are  constructed.  Scalability 
results  and  a  comparison  of  solutions  from  the  resulting  reduce  order  models  to  the 
high-fidelity  results  are  presented. 


(a)  Wall  Clock  Time 


Fig.  1  Scalability  study  using  a  65,536  x  131,072  random  matrix.  Each  calculation  uses  a 
single  MPI  process  per  node.  n'"'a:r'  —  1,000  for  each  LPQN  and  LPN  calculation.  The  num¬ 
ber  of  free  variables  in  the  solution  is  821.  The  relative  error  between  the  PLH  and  PQN 
solutions  isO(10_7)or  better. 


Fig.  2  V-hull  finite  element  model. 


Figure  3  shows  the  results  of  a  scalability  study  using  the  generic  V-hull  vehicle 
ECSW  matrix  and  r  =  .1.  The  PLH  method  was  applied  to  both  the  scaled  and  un¬ 
sealed  matrix  to  highlight  the  performance  differences.  To  understand  the  character 
of  this  matrix,  its  singular  values  when  scaled  and  unsealed  are  plotted  in  Fig.  4.  The 
PLH  data  show  a  significant  increase  in  compute  time  and  a  decrease  in  scalability 
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for  the  unsealed  case.  The  primary  reason  for  this  increase  is  the  downdate  proce¬ 
dure  is  executed  many  more  times  when  the  matrix  is  unsealed  then  scaled.  The  data 
therefore  indicates  the  scalability  of  the  downdate  procedure  is  not  as  good  as  the 
main  iteration  loop.  The  number  in  parenthesis  beside  the  figure  labels  is  the  size 
of  the  reduced  mesh  generated,  or  equivalently,  the  nonzero  entries  in  the  NNLS 
solution. 


(a)  Wall  Clock  Time 


(b) Speedup 


Fig.  3  Scalability  study  using  ECSW  matrix  from  ROM  of  the  generic  vehicle. 


Fig.  4  Singular  values  for  the  ECSW  matrix  for  the  scaled  and  unsealed  generic  Y-hull  prob¬ 
lem. 


Scaling,  or  preconditioning,  is  generally  a  good  and  sometimes  necessary  step. 
There  is,  however,  a  question  as  to  whether  or  not  it  is  appropriate  for  ECSW  hyper¬ 
reduction.  A  finite  element  model  might  consist  of  materials  with  widely  varying 
properties  and  scaling  might  bias  the  construction  of  the  reduced  mesh  in  a  way 
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not  intended  or  wanted.  The  answer  to  this  is  not  clear  and  not  addressed  here.  It  is 
pointed  out  because  if  scaling  is  not  desired,  the  current  implementation  of  the  PQN 
method  would  not  be  usable  on  this  particular  ECSW  matrix  because  it  performs 
poorly  when  not  scaled.  Consequently,  no  results  are  presented  for  the  PQN  method 
for  the  unsealed  matrix. 

For  the  scaled  matrix,  the  results  in  Fig.  3  show  the  PQN  method  is  faster  than 
the  PFH  method,  but  the  implementation  is  not  as  scalable.  The  degradation  in 
scalability  is  more  pronounced  here  then  in  the  larger,  random  matrix  presented 
before.  This  could  be  due  in  part  to  M  =  20,  000  being  smaller. 

Because  the  PQN  method  is  not  appropriate  for  hyper-reduction  without  limiting  the 
active  set,  a  global  limit  and  a  per  iteration  limit  are  imposed.  The  global  limit  was 
set  to  3,200  free  variables,  and  the  per  iteration  limit  was  set  to  3  plus  the  limit  at  the 
previous  iteration.  It  was  found  that  if  just  a  global  limit  is  imposed,  that  limit,  as 
expected,  is  reached  immediately,  but  the  iteration  stalls  trying  to  reach  the  specified 
tolerance,  that  is,  the  residual  rate  of  decrease  drops  significantly.  Imposing  a  per 
iteration  limit  improved  convergence.  The  global  limit  of  3,200  was  reached  in  the 
r  =  .01  case.  In  both  cases,  PQN  produced  larger  reduced  meshes  than  PFH. 

Figures  5  and  6  show  the  results  of  the  displacements  and  velocities  computed  with 
reduced-order  models  constructed  with  results  form  PFH  and  PQN  methods  and 
also  shows  the  corresponding  results  from  the  high-fidelity  model.  The  results  are 
from  an  element  on  the  bottom  surface  of  the  vehicle  near  the  explosion.  Table  6 
shows  the  reduced  mesh  sizes  associated  with  the  the  difference  tolerances  and 
methods  presented  it  these  figures.  Two  tolerances  are  used  in  the  ECSW  step:  the 
figures  on  the  left  are  the  results  with  r  =  .1  and  the  ones  on  the  right  are  with 
r  =  .01.  In  this  case,  both  tolerances  are  acceptable.  With  r  =  .01  the  error  over 
the  entire  domain  was  computed  to  be  less  than  2%. 
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Fig.  5  Displacements  at  an  element  on  the  vehicle  near  the  explosion  computed  by  a  reduced- 
order  model  constructed  using  PLH  and  LPQN  compared  to  the  high-fidelity  model.  Left  plots 
used  T—  0.1  and  right  plots  used  r=  .01. 
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Fig.  6  Velocities  at  an  element  on  the  vehicle  near  the  explosion  computed  by  a  reduced-order 
model  constructed  using  PLH  and  LPQN  methods  compared  to  the  high-fidelity  model.  Left 
plots  used  r=  0.1  and  right  plots  used  r=  .01. 


Table  6  Reduced  mesh  sizes  produced  for  each  solver  in  the  ECSW  hyper-reduction  step. 


Method 

T  =1 

T=  .01 

plh 

902 

2,826 

lpqn3 

1,017 

3,200 
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5.  Conclusions 


When  solving  for  optimal  solutions,  the  PLH  code  displays  the  best  scalability  and 
performs  very  well  when  the  design  assumption  that  the  number  of  free  variables 
in  the  solution  is  significantly  less  than  the  column  dimension  of  the  matrix  is  met. 
It  also  outperforms  the  PQN  code  when  the  matrix  is  ill-conditioned  and  scaling, 
or  preconditioning,  is  not  done.  However,  in  most  situations  the  PQN  code  outper¬ 
forms  the  PLH  code,  and  in  many  cases  significantly  so,  particularly  if  the  design 
assumption  is  violated.  PQN  is  therefore  the  method  of  choice  for  well-behaved 
NNLS  problems. 

When  applied  to  ECSW  hyper-reduction  for  the  under  body  blast  problem,  the  re¬ 
sults  from  the  2  codes  are  mixed.  If  the  matrix  is  scaled,  the  PQN  method  with  cer¬ 
tain  free  variable  limiting  parameters  produced  solutions  significantly  faster  than 
the  PLH  code;  however,  in  all  cases  the  PLH  code  produced  the  sparsest  solution. 
When  the  matrix  is  not  scaled,  convergence  of  the  PQN  method  is  either  too  slow 
or  nonexistent  and  PLH  is  the  only  option.  Scalability  of  the  PLH  code  on  this 
problem  is  again  better  than  the  PQN  code.  It  is  even  more  pronounced  here  but  a 
contributing  factor  is  most  likely  due  in  part  to  the  relatively  small  row  dimension 
(20,000)  of  matrix. 

Limiting  the  number  of  free  variables  in  the  PQN  method  for  the  ECSW  problem 
is  absolutely  necessary,  but  determining  the  limit  per  iteration  and  the  global  limit 
is  not  straightforward.  The  generic  vehicle  results  presented  used  a  growth  factor 
of  3  free  variables  per  iteration  with  a  maximum  of  3,200.  This  produced  solutions 
significantly  faster  than  PLH,  but  some  trial  and  error  is  required  to  do  this.  On 
the  other  hand,  solving  for  the  optimal  solution  of  well-behaved  NNLS  problems 
benefited  greatly  from  limiting  and  only  a  global  limit  needed  to  be  defined.  Some 
prior  knowledge  or  some  trial  and  error  may  be  needed  to  be  sure  the  global  limit 
does  not  interfere  with  finding  the  optimal  solution,  but  it  can  be  large  and  still 
beneficial.  This  is  not  as  restrictive  as  in  the  hyper-reduction  case. 

The  PLH  code  is  the  best  choice  for  use  with  ECSW  hyper-reduction  when  the 
system  is  large  and  ill-conditioned  and  the  PQN  is  best  for  finding  the  solutions  for 
well-behaved  NNLS  problems.  More  work  clearly  needs  to  be  done  to  improve  the 
scalability  of  the  PQN  code  to  make  it  competitive  on  very  large  problems.  Even 
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so,  with  or  without  limiting  it  is  the  best  method  for  finding  optimal  solutions  on 
many  size  problems.  Additional  work  needs  to  be  done  to  make  it  competitive  for 
hyper-reduction. 
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List  of  Symbols,  Abbreviations,  and  Acronyms 


AHPCRC 

Army  High  Performance  Computing  Research  Center 

BFGS 

Broyden-Fletcher-Goldfarb-Shanno 

ECSW 

energy  conserving  sampling  and  weighting 

KKT 

Karu  sh-  Kuhn-Tucker 

L-BFGS 

Limited  memory  Broyden-Fletcher-Goldfarb-Shanno 

LH 

Lawson  and  Hanson 

LPN 

limited  projected  Newton 

LPQN 

limited  projected  quasi-Newton 

NNLS 

nonnegative  least  squares 

PLH 

parallel  Lawson  and  Hanson 

PN 

projected  Newton 

PQN 

projected  quasi-Newton 

ROM 

reduced-order  model 
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